DSBA Club

Gus Lipkin

10/29/22

Topics may include…

  • A bit about me
    • Me
    • My time at Poly
    • My current role
  • The main event
  • Some other stuff

A bit about me

Me

  • I like asking questions
  • I have a cute cat
  • I’m from Massachusetts
  • I self-host my own server for data storage and running a pi-hole
  • I have three bikes, a tandem, and three unicycles

My time at Poly

  • Data Science -> Business Analytics for business/management classes
  • Concentrations in…
    • Quantitative Economics & Econometrics
    • Intelligent Mobility
  • Capstone project with Tallahassee Memorial Healthcare

My current role

Data Analyst at Publix Supermarkets

  • Since just after graduation in May
  • Working on…
    • Shrink Reports (what goods couldn’t be sold and why)
    • Contribution Margin (💲profit💲)

The Main Event

A real-world problem

Overview

We want to know the cost to ship items to stores at the weekly store item level. That is, how much does it cost to ship the same item to different stores, or different items to the same store, with each scenario covering a week’s worth of costs? ($/store/item)

We have two tables that describe warehouse to store distribution. One contains information on trips a truck takes to deliver to stores on its planned route. The second contains information on items shipped to each store.

For each trip, there may be many stores. For each store, there may be many items.

Tables

How would you connect these two tables so that each item is assigned a cost from the correct trip?

TRIPS

  • TRIP_CODE (chr)
  • DATE (date)
  • ORIGIN_WHSE (int)
  • STOP (int)
  • ORDER_NUMBER (chr)
  • SHIP_COST (dbl)

WHSE

  • FILL_CODE (chr)
  • DATE (date)
  • SHIP_WHSE (int)
  • STOP (int)
  • INVOICE_NUMBER (chr)
  • ITEM_ID (int)
  • ITEM_CUBE (dbl)

Exact Matching

TRIPS

DATE

ORIGIN_WHSE

STOP


==

==

==

WHSE

DATE

SHIP_WHSE

STOP

A needed explanation

TRIPS


WHSE

DATE

==

DATE

Because we’re aggregating at a weekly level, matching by the smallest interval possible is okay so long as it’s shorter than a week.

A needed explanation

TRIPS


WHSE

DATE

==

DATE

ORIGIN_WHSE

==

SHIP_WHSE

While these are not always equal, we can match many rows and then figure out what to do with mismatches later.

A needed explanation

TRIPS


WHSE

DATE

==

DATE

ORIGIN_WHSE

==

SHIP_WHSE

STOP

==

STOP

There should always be a matching store number.

A needed explanation

TRIPS


WHSE

DATE

==

DATE

ORIGIN_WHSE

==

SHIP_WHSE

STOP

==

STOP

What about the leftover rows? This method only matched ~66% of the items to trips.

Matching WHSE type

TRIPS

DATE

ORIGIN_WHSE
%/% 100

STOP


==

==


==

WHSE

DATE

SHIP_WHSE
%/% 100

STOP

A needed explanation

TRIPS


WHSE

DATE

==

DATE

ORIGIN_WHSE
%/% 100

==

SHIP_WHSE
%/% 100

A WHSE_ID (XYZZ) is made up of three parts. The first digit denotes that the facility is a warehouse, the second the warehouse type, the last two the unique ID. Matching the first two digits will give you matching warehouse types. Produce/produce, frozen/frozen, HV/HV, LV/LV, etc.

STOP

==

STOP

A needed explanation

TRIPS


WHSE

DATE

==

DATE

ORIGIN_WHSE
%/% 100

==

SHIP_WHSE
%/% 100

STOP

==

STOP

This still leaves about 10% of the items without costs.

Forgetting about warehouses

TRIPS

DATE

STOP


==

==

WHSE

DATE

STOP

Every item still needs a cost allocated, but since we have no way to link a trip and an item, we’ll just have to hope for the best and make do with what we have.

Summing up

  1. Match on DATE, STOP, and ORIGIN_WHSE/SHIP_WHSE
  2. Match on DATE, STOP, and derived warehouse type
  3. Match on DATE and STOP

Designing a pipeline

A generic guide

Designing a pipeline

  1. Examine your source data
  2. Envision your end data format
  3. Design backwards then code forwards

Designing a pipeline

How can we go from this…

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

…to this?

Examine the data

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Envision the end data format

# A tibble: 6 × 3
  cyl   name  value
  <fct> <chr> <dbl>
1 4     disp  105. 
2 4     mpg    26.7
3 6     disp  183. 
4 6     mpg    19.7
5 8     disp  353. 
6 8     mpg    15.1

Working backwards

library(tidyverse)

mtcars |>
  ggplot() +
    geom_bar(aes(x = cyl, y = value, fill = name), 
             stat = "identity", position = "dodge")

Working backwards

library(tidyverse)
  
mtcars |>
  mutate(cyl = as.factor(cyl)) |>
  ggplot() +
    geom_bar(aes(x = cyl, y = value, fill = name), 
             stat = "identity", position = "dodge")

Working backwards

library(tidyverse)

mtcars |>
  group_by(cyl) |>
    summarise(disp = mean(disp),
              mpg = mean(mpg)) |>
    ungroup() |>
  mutate(cyl = as.factor(cyl)) |>
  ggplot() +
    geom_bar(aes(x = cyl, y = value, fill = name), 
             stat = "identity", position = "dodge")

Working backwards

library(tidyverse)

mtcars |>
  group_by(cyl) |>
    summarise(disp = mean(disp),
              mpg = mean(mpg)) |>
    ungroup() |>
  pivot_longer(cols = c(disp, mpg),
               names_to = "name",
               values_to = "value") |>
  mutate(cyl = as.factor(cyl)) |>
  ggplot() +
    geom_bar(aes(x = cyl, y = value, fill = name), 
             stat = "identity", position = "dodge")

2022-10-29 DSBA Club

Gus Lipkin